The Relationship Between Income and Life Expectancy
Final Project
Author
Hayley C., Felicia P., Ian L., Shane W.
1 Library Imports
2 Data
2.1 Variables
2.2 Daily Mean Income Household Per Capita
GapMinder compiled data on daily mean household income per capita to analyze income distributions over hundreds of years. These figures are anchored in the official Mean Income indicator from the World Bank, derived from household surveys. For countries lacking World Bank data, GapMinder estimated mean income based on GDP per capita. The available time frame for actual World Bank data spans from 1967 to 2021, though most countries have limited data points within these years.
GapMinder used growth rates of constant dollar GDP per capita to estimate mean incomes historically from 1800 and project them up to 2100. For the period 1981-2019, they relied on World Bank data, known for its comprehensive coverage, published in the World Development Indicators as “Survey mean consumption or income per capita, total population (2017 PPP dollars per day).” The indicator they reference is described in the World Bank Poverty and Inequality Platform (PIP) as “Indicators Survey mean/average consumption or income per capita, total population (2017 PPP dollars per day). The average consumption is unimportant to our project but is collected in tandem by the World Bank with Income Per Capita. The mean represents the average monthly household per capita income or consumption expenditure from the survey in 2017 PPP.”
In short, the average daily income is the mean daily household per capita income or consumption expenditure from the survey, expressed in 2017 constant international dollars.
2.2.1 Life Expectancy
GapMinder collects life expectancy data from various sources to create a comprehensive dataset spanning from 1800 to 2100. Life expectancy at birth refers to the average number of years a newborn is expected to live, assuming that current mortality rates remain constant throughout their lifetime.
For the period from 1800 to 1970, GapMinder relies on its own compiled data (version 7), which includes information from over 100 sources and accounts for historical events causing significant mortality dips. From 1950 to 2019, data is primarily sourced from the Global Burden of Disease Study 2019 by the Institute for Health Metrics and Evaluation (IHME). This source provides detailed annual estimates. For projections from 2020 to 2100, GapMinder uses forecasts from the United Nations’ World Population Prospects 2022. The data is carefully combined, prioritizing IHME data when available, and extending IHME series with UN estimates for future projections.
2.3 Hypothesized Relationship Between the Variables
Higher average daily income is positively associated with higher life expectancy at birth.
To the clean the data, we looked at the data types of the values and saw that all the numbers were of the type, character, despite having their class be numeric. To clean this, we mutated each year’s column to be a numeric type.
The year names initially had an X in front of the name when the data was first loaded. We chose to remove this naming convention after pivoting the data so that we can easily reference the years when graphing our data.
Instead of eliminating NA values in average, those values were left so that when joining the data, we can make a decision which years or countries to pick based on data that overlaps between the data frames.
2.5 How the Data was Pivoted
Next, we pivoted the data by country to separate each year into individual observations. For each country and year, we now have the corresponding average daily income and average life expectancy.
2.6 How the Data was Joined
In order to create one data table, we must join our two data sets that were cleaned and pivoted. One way we can do this is through an inner join, which will also handle and missing data by dropping it.
In addition to joining the data, the name of the “country” column was capitalized in order to have uniformity among the variable names.
3 Linear Regression
3.1 Exploring the Relationship Between the Two Variables
The variables to be explored are the average daily income in relation to the average life expectancy. The relationship to be explored is how the income effects the life expectancy.
The explanatory variable is the average income and the response variable is the average life expectancy.
To explore the relationship overtime
3.2 Linear Regression
3.2.1 Steps to Choosing Regression Features
Linear regression was simplified by taking the year 2010. The reason for this is because daily income and life expectancy have shown significant changes over the centuries, making it challenging to capture the full extent of these trends in a single regression model.
Historical data from the 1800s to the present day illustrates substantial shifts in both daily income and life expectancy, reflecting changes in economic, social, and healthcare systems globally.
By selecting the year 2010 as a reference point, we aim to focus on a period that represents a modern snapshot of these trends. Here’s why 2010 is a good choice:
. Representative Modern Era: 2010 serves as a representative point in the modern era, offering insights into contemporary socioeconomic and health conditions across countries.
. Mitigation of Predicted Data: The decision to exclude years beyond 2010 accounts for the absence of actual data and instead focuses on observed trends. This approach prevents potential biases introduced by predicted data, particularly in later years beyond the data collection timeframe.
. Adequate Time for Analysis: With 14 years having passed since 2010, this timeframe provides sufficient data for analysis while minimizing the impact of short-term fluctuations that may occur within smaller time intervals.
By anchoring our analysis to the year 2010, we aim to capture meaningful trends in daily income and life expectancy while ensuring the reliability and relevance of our linear regression model.
The linear regression formula is \(\hat{y} = 0.2669x + 65.1162\) where \(x\) is the daily income in 2010 and \(y\) is the life expectancy in 2010.
3.2.3 Interpretation of coefficients:
Intercept (65.1162): The intercept term represents the estimated life expectancy in the year 2010 when daily income is zero. However, this interpretation may not be practically meaningful since daily income cannot be zero. It is more relevant to interpret the intercept as the life expectancy when daily income is at its lowest observed value in the data set.
Daily Income Coefficient (0.2669): The coefficient of daily income (0.2669) indicates the estimated change in life expectancy for a one-unit increase in daily income, holding all other variables constant. In this context, it suggests that, on average, for each additional unit increase in daily income, the life expectancy increases by 0.2669 years, given that all other factors remain constant.
These interpretations provide insights into the relationship between daily income and life expectancy in the year 2010, as captured by the estimated regression model.
3.3 Model Fit
Variance Summary of Regression Model
Total Variance
Fitted Variance
Residual Variance
75.7859
31.69877
44.08713
The total variance in the model is 75.79. Of the total variance, 31.70 is explained by the model, which leads us to an \(R^2\) of 41.82%. The remaining 44.08 in the total variance is unexplained.
Based on the \(R^2\) of 41.82%, the model quality is moderate to poor. The majority of the variability in life expectancy is not explained by daily income.
Source Code
---title: "The Relationship Between Income and Life Expectancy"subtitle: "Final Project"author: "Hayley C., Felicia P., Ian L., Shane W."format: html: embed-resources: true code-tools: true toc: true number-sections: trueeditor: sourceexecute: error: true echo: false message: false warning: false---## Library Imports```{r}library(tidyverse)library(dplyr)library(gganimate)library(plotly)library(broom)library(knitr)library(kableExtra)```## Data### Variables### Daily Mean Income Household Per CapitaGapMinder compiled data on daily mean household income per capita to analyze income distributions over hundreds of years. These figures are anchored in the official Mean Income indicator from the World Bank, derived from household surveys. For countries lacking World Bank data, GapMinder estimated mean income based on GDP per capita. The available time frame for actual World Bank data spans from 1967 to 2021, though most countries have limited data points within these years.GapMinder used growth rates of constant dollar GDP per capita to estimate mean incomes historically from 1800 and project them up to 2100. For the period 1981-2019, they relied on World Bank data, known for its comprehensive coverage, published in the World Development Indicators as "Survey mean consumption or income per capita, total population (2017 PPP dollars per day)." The indicator they reference is described in the World Bank Poverty and Inequality Platform (PIP) as "Indicators Survey mean/average consumption or income per capita, total population (2017 PPP dollars per day). The average consumption is unimportant to our project but is collected in tandem by the World Bank with Income Per Capita. The mean represents the average monthly household per capita income or consumption expenditure from the survey in 2017 PPP."In short, the average daily income is the mean daily household per capita income or consumption expenditure from the survey, expressed in 2017 constant international dollars.#### Life ExpectancyGapMinder collects life expectancy data from various sources to create a comprehensive dataset spanning from 1800 to 2100. Life expectancy at birth refers to the average number of years a newborn is expected to live, assuming that current mortality rates remain constant throughout their lifetime.For the period from 1800 to 1970, GapMinder relies on its own compiled data (version 7), which includes information from over 100 sources and accounts for historical events causing significant mortality dips. From 1950 to 2019, data is primarily sourced from the Global Burden of Disease Study 2019 by the Institute for Health Metrics and Evaluation (IHME). This source provides detailed annual estimates. For projections from 2020 to 2100, GapMinder uses forecasts from the United Nations' World Population Prospects 2022. The data is carefully combined, prioritizing IHME data when available, and extending IHME series with UN estimates for future projections.### Hypothesized Relationship Between the VariablesHigher average daily income is positively associated with higher life expectancy at birth. ```{r Load data}average_daily_income_data <- read.csv("./mincpcap_cppp.csv")life_expectancy_data <- read.csv("./lifeExpectancy.csv")``````{r Check Column Types}average_daily_income_data |> summarise(across(.cols = everything(), .fns = typeof))life_expectancy_data |> summarise(across(.cols = everything(), .fns = typeof))``````{r }average_daily_income_data <- average_daily_income_data |> mutate(across(-country, as.numeric))life_expectancy_data <- life_expectancy_data |> mutate(across(-country, as.numeric))```### How the Data was CleanedTo the clean the data, we looked at the data types of the values and saw that all the numbers were of the type, character, despite having their class be numeric. To clean this, we mutated each year's column to be a numeric type.The year names initially had an X in front of the name when the data was first loaded. We chose to remove this naming convention after pivoting the data so that we can easily reference the years when graphing our data.Instead of eliminating NA values in average, those values were left so that when joining the data, we can make a decision which years or countries to pick based on data that overlaps between the data frames.```{r}avg_di_long <- average_daily_income_data|>pivot_longer(cols =-country,names_to ="Year",values_to ="Average Daily Income") |>mutate(Year =as.integer(str_remove(Year, "X")))avg_le_long <- life_expectancy_data|>pivot_longer(cols =-country,names_to ="Year",values_to ="Average Life Expectancy") |>mutate(Year =as.integer(str_remove(Year, "X")))```### How the Data was PivotedNext, we pivoted the data by country to separate each year into individual observations. For each country and year, we now have the corresponding average daily income and average life expectancy.### How the Data was JoinedIn order to create one data table, we must join our two data sets that were cleaned and pivoted. One way we can do this is through an inner join, which will also handle and missing data by dropping it.```{r}daily_income_and_life_expectancy <- avg_di_long |>inner_join(avg_le_long, join_by(country, Year)) |>rename(Country = country)```In addition to joining the data, the name of the "country" column was capitalized in order to have uniformity among the variable names.## Linear Regression### Exploring the Relationship Between the Two VariablesThe variables to be explored are the average daily income in relation to the average life expectancy. The relationship to be explored is how the income effects the life expectancy.The explanatory variable is the average income and the response variable is the average life expectancy.```{r}daily_income_and_life_expectancy |>ggplot(aes(x =`Average Daily Income`, y =`Average Life Expectancy`)) +geom_jitter(alpha =0.7, color ="Steel Blue") +theme_minimal() +labs(title ="Relationship between Average Daily Income and Life Expectancy",x ="Average Daily Income",y ="",subtitle ="Average Life Expectancy at Birth") ```To explore the relationship overtime```{r}daily_income_and_life_expectancy |>plot_ly(x =~`Average Daily Income`,y =~`Average Life Expectancy`,text =~Country,frame =~Year,type ='scatter',mode ='markers',marker =list(size =10, opacity =0.7) ) |>layout(title ="Changes in Average Daily Income and Life Expectancy Over Time",xaxis =list(title ="Average Daily Income"),yaxis =list(title =""),annotations =list(list(x =0.5,y =1.05,xref ="paper",yref ="paper",text ="Average Life Expectancy at Birth",showarrow =FALSE,font =list(size =8) ) ),font =list(size =8) ) |>animation_opts(frame =1000, # milliseconds per frametransition =0, # duration of the transition between framesredraw =FALSE ) |>animation_slider(currentvalue =list(prefix ="Year: ") )```### Linear Regression#### Steps to Choosing Regression FeaturesLinear regression was simplified by taking the year 2010. The reason for this is because daily income and life expectancy have shown significant changes over the centuries, making it challenging to capture the full extent of these trends in a single regression model.Historical data from the 1800s to the present day illustrates substantial shifts in both daily income and life expectancy, reflecting changes in economic, social, and healthcare systems globally.By selecting the year 2010 as a reference point, we aim to focus on a period that represents a modern snapshot of these trends. Here's why 2010 is a good choice:\t 1. Representative Modern Era: 2010 serves as a representative point in the modern era, offering insights into contemporary socioeconomic and health conditions across countries.\t 2. Mitigation of Predicted Data: The decision to exclude years beyond 2010 accounts for the absence of actual data and instead focuses on observed trends. This approach prevents potential biases introduced by predicted data, particularly in later years beyond the data collection timeframe.\t 3. Adequate Time for Analysis: With 14 years having passed since 2010, this timeframe provides sufficient data for analysis while minimizing the impact of short-term fluctuations that may occur within smaller time intervals.By anchoring our analysis to the year 2010, we aim to capture meaningful trends in daily income and life expectancy while ensuring the reliability and relevance of our linear regression model.#### Regression Code```{r}# Code for Q4.average_data_years <- daily_income_and_life_expectancy |>filter(Year ==2010) |>rename(daily_income_2010 =`Average Daily Income`,life_expectancy_2010 =`Average Life Expectancy`) |>select(Country, daily_income_2010, life_expectancy_2010)average_data_years_lm <-lm(life_expectancy_2010 ~ daily_income_2010, data = average_data_years)average_data_years_lm```The linear regression formula is $\hat{y} = 0.2669x + 65.1162$ where $x$ is the daily income in 2010 and $y$ is the life expectancy in 2010.#### Interpretation of coefficients:Intercept (65.1162): The intercept term represents the estimated life expectancy in the year 2010 when daily income is zero. However, this interpretation may not be practically meaningful since daily income cannot be zero. It is more relevant to interpret the intercept as the life expectancy when daily income is at its lowest observed value in the data set.Daily Income Coefficient (0.2669): The coefficient of daily income (0.2669) indicates the estimated change in life expectancy for a one-unit increase in daily income, holding all other variables constant. In this context, it suggests that, on average, for each additional unit increase in daily income, the life expectancy increases by 0.2669 years, given that all other factors remain constant.These interpretations provide insights into the relationship between daily income and life expectancy in the year 2010, as captured by the estimated regression model.### Model Fit```{r}# get residuals from modelresiduals <-augment(average_data_years_lm)model_variance_summary <- average_data_years |>summarise(Total_Variance =var(average_data_years$life_expectancy_2010),Fitted_Variance =var(average_data_years_lm$fitted.values),Residual_Variance =var(residuals(average_data_years_lm)) )kable(model_variance_summary, caption ="Variance Summary of Regression Model",col.names =c("Total Variance","Fitted Variance","Residual Variance"))```The total variance in the model is 75.79. Of the total variance, 31.70 is explained by the model, which leads us to an $R^2$ of 41.82%. The remaining 44.08 in the total variance is unexplained.Based on the $R^2$ of 41.82%, the model quality is moderate to poor. The majority of the variability in life expectancy is not explained by daily income.```{r}rand_error <-function(x, mean =0, sd){# Add the body. x +rnorm(length(x), mean, sd)}noise <-function(x, mean =0, sd){ x +rnorm(length(x), mean, sd)}``````{r}# linear regression modeldata_lm <- average_data_years_lm# predict valuepred_expectancy <-predict(data_lm)# get sigmaest_sigma <-sigma(data_lm)# generate new fake data setnsims <-1000sims <-map_dfc(.x =1:nsims,.f =~tibble(sim =noise(pred_expectancy, sd = est_sigma) ) ) |>rename_with(~str_replace(., pattern ="\\.\\.\\.", replacement ="_"))sims <- average_data_years |>filter(!is.na(life_expectancy_2010),!is.na(daily_income_2010)) |>select(life_expectancy_2010, daily_income_2010) |>bind_cols(sims)suppressMessages(suppressWarnings({ sim_r_sq <- sims |>map(~lm(life_expectancy_2010 ~ .x, data = sims)) |>map(glance) |>map_dbl(~ .x$r.squared) sim_r_sq <- sim_r_sq[names(sim_r_sq) !="life_expectancy_2010"]}))``````{r}tibble(sims = sim_r_sq) |>ggplot(aes(x = sims)) +geom_histogram(binwidth =0.025) +labs(x =expression("Simulated"~ R^2),y ="",title =expression("Simulated "~ R^2~"Values versus the Observed"~ R^2),subtitle ="Number of Simulated Models") +theme_bw()```